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Abstract 

We study the communication over Finite State Channels (FSCs), where the encoder and the decoder can control the availability 
or the quality of the noise-free feedback. Specifically, the instantaneous feedback is a function of an action taken by the encoder, 
an action taken by the decoder, and the channel output. Encoder and decoder actions take values in finite alphabets, and may be 
subject to average cost constraints. 

We prove capacity results for such a setting by constructing a sequence of achievable rates, using a simple scheme based on 
'code tree' generation, that generates channel input symbols along with encoder and decoder actions. We prove that the limit of 
this sequence exists. For a given block length TV and probability of error, e, we give an upper bound on the maximum achievable 
rate. Our upper and lower bounds coincide and hence yield the capacity for the case where the probability of initial state is positive 
for all states. Further, for stationary indecomposable channels without intersymbol interference (ISI), the capacity is given as the 
limit of normalized directed information between the input and output sequence, maximized over an appropriate set of causally 
conditioned distributions. As an important special case, we consider the framework of 'to feed or not to feed back' where either 
the encoder or the decoder takes binary actions, which determine whether current channel output will be fed back to the encoder, 
with a constraint on the fraction of channel outputs that are fed back. As another special case of our framework, we characterize 
the capacity of 'coding on the backward link' in FSCs, i.e. when the decoder sends limited-rate instantaneous coded noise-free 
feedback on the backward link. Finally, we propose an extension of the Blahut-Arimoto algorithm for evaluating the capacity 
when actions can be cost constrained, and demonstrate its application on a few examples. 

Index Terms 

Actions, Blahut-Arrimoto Algorithm, Causal Conditioning, Channel with States, Cost Constraints, Directed Information, Feed- 
back Sampling, Indecomposable Channel, Intersymbol Interference, Sampled Feedback, Time-invariant Deterministic Feedback, 
To Feed or Not to Feed Back. 



I. Introduction 

Feedback plays a very important role in communication systems. Despite proving a pessimistic result in HI that feedback does 
not increase the capacity of a memoryless channel, Shannon did foresee the important role of feedback, which he highlighted 
in the first Shannon Lecture. Indeed, even for memoryless channels, feedback has its merits, such as simple capacity achieving 
coding schemes and improved reliability, ||Z), Q. Feedback is also known to increase the capacity for multiple-access channels, 
|4| and broadcast channels, [5|,[6|. 

In his book [7|, Gallager introduced finite state channels (FSCs) as an apt model for a very broad family of channels with 
memory. When no feedback is present and the channel is stationary and indecomposable without ISI, the capacity was shown 
by Gallager in J7] and by Blackwell, Breiman and Thomasian in JSJ to be 

C NF = lim — max I(X N ;Y N ). (1) 

N^oo N P(x N ) 

For the case of no ISI, stationary and indecomposable finite state channels with time invariant deterministic feedback, the 
capacity was shown in |9| to be, 

C FB = lim — max I(X N -> Y N ), (2) 

JV->oo N Q(x N \\z N - 1 ) 

where Q(x N || z ff_1 ) is causal conditioning introduced by Kramer in iflOl , ifTTI and is defined as, 

N 

Q(x N \\z N - 1 ) = HQO^-V" 1 ). (3) 

Here is a time-invariant deterministic function of the output 3^. Subsequent work on FSCs included the characterization 
of the capacity of finite state multiple access channel in fl2l . When the channels have memory, feedback can increase the 
capacity even for single user channels. One such example is the chemical channel introduced in 1961 by Blackwell in fl3l . 
also referred to as the 'trapdoor channel' by Ash in [14|. The capacity of this channel without feedback is a long-standing 
open problem with only bounds on it known, such as those established by Kobayashi et al in [15 |, |T"6|. With feedback, the 
capacity of the trapdoor channel was computed in IfTTI using dynamic programming approach and shown to be strictly higher 
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than the capacity without feedback. For Gaussian channels with memory, Cover and Pombra in |18| showed feedback cannot 
increase the capacity of an additive white gaussian channel by more than half of a bit. Kim characterized the capacity of a 
wide class of stationary Gaussian channels with feedback in f\M . 

Directed information, denoted by I(X N —> Y N ), was introduced by Massey in GOl . where he credits it to Marko 12D . 
It was further shown that directed information equals mutual information for memoryless channels iff there is no feedback 
by Massey and Massey in 11221 . Directed information also appears in the work of Tatikonda et al, [23|, |24|, where there is 
generalization of work by Verdu and Han in [25 1 for the case of channels with feedback. Capacity of some Markovian Channels 
was computed using directed information by Yang et al in (2F| and Chen and Berger in ll27l . Tatikonda also formulated the 
problem of computing capacities of channels with feedback as a Markov Decision Process in |28|. Zero error capacity was also 
computed using dynamic programming in (29) . Recently, interpretations of directed information in gambling, portfolio theory 
and estimation have been characterized in [ 30 1 , I13TI and ll32l . The capacity of the compound channel with feedback was 
computed in l33ll using directed information. Directed information also appeared in rate distortion problems, such as source 
coding with feed-forward by Pradhan and Venkataramanan f34j, and implicitly in the competitive prediction framework of 

ea. ' 

In 11361 . the notion of actions in a source coding context was introduced. Their setting is a generalization of the Wyner-Ziv 
source coding with decoder side information problem in [37], where now the decoder can take actions based on the index 
obtained from the encoder to affect the formation or availability of side information. In [38], the channel coding dual is studied 
where the transmitter takes actions that affect the formation of channel states. This framework captures various new coding 
scenarios which include two stage recording on a memory with defects, motivated by similar problems in magnetic recording 
and computer memories. Kittichokechai et al in |39| studied a variant of the problem in [36| and [38|, where encoder and 
decoder both have action dependent partial side information. However, in the source coding formulation of IT361 . attention was 
restricted to the case where the actions are taken by the decoder while in the channel coding scenario of l38l and 11391 . actions 
were taken only by the encoder. Recently, in [40], the channel coding setting in [38 1 and [39| was generalized, to accommodate 
the case where both the encoder and the decoder take channel probing actions, with associated costs, to maximize the rate of 
reliable communication. This was referred to as the 'Probing Capacity'. 

In this paper, we introduce the notion of actions in acquisition of noise-free feedback or its deterministic function for 
FSCs. The main contribution of this paper is in characterizing the cost-capacity trade-off when the feedback observed by the 
encoder is a deterministic function of an action taken by the encoder, an action taken by the decoder, and the channel output, 
when actions are required to satisfy an average cost constraint. More precisely, the encoder observes 'sampled'' feedback 
Zi = f(A ei ,Ad.i 1 Y i ), where /(•) is a deterministic function, Yi is the channel output, A ei — A ei (M, Z 1 ^ 1 ) is the action 
taken by the encoder as a function of the message and the past sampled feedback, and Ad.i is the action taken by the decoder, 
where we study two scenarios: one where that action is strictly causal in the channel output, i.e., Adj. = Ad.iiY" 1-1 ), and 
one where it can depend also on the present channel output, i.e., Ad,i = Ad,i{Y % ). The problem is motivated by practical 
applications where acquisition of the feedback may be costly, and either or both the encoder and decoder influence whether 
and what from the channel output is to be fed back. 

The key technique in our achievability result lies in generating both actions and input symbol code trees, as described in 
Section IV With this achievability, we find most of the proof follows that in (9), except for some cases where care has to 
be taken to properly handle cost constraints. This is because the presence of cost constraints results in breaking down of 
some properties that were used in [9] such as sub-additivity. The main contribution of our paper is in obtaining a multi- 
letter characterization of the capacity for our communication scenario, involving maximization over directed information. In 
order to numerically evaluate the capacity when actions are cost constrained, we also propose a Blahut-Arimoto type algorithm, 
[41 1,|42|, similar to that proposed in [43 1, where the objective was to maximize the multi-letter directed information expression. 
Also our characterization of capacity admits a dynamic optimization formulation that can lead to analytic closed form capacity 
expressions for specific channels, similarly as in [17 1, |44|, though its pursuance has not been the part of this work. 

A special case of our framework is when only the encoder or the decoder is the one taking actions. Under this setting, we 
motivate and compute a special case of to feed or not to feed back, i.e., where actions are binary corresponding to observing 
the channel output or not observing it, the cost constraint corresponding to the fraction of channel output observations allowed, 
and the channel states evolve as a markov chain independent of the channel input process. When only the encoder takes 
actions, we derive a single letter lower bound on this capacity and show that it is strictly better than the rate achieved by a 
naive time sharing scheme between capacity at zero cost (corresponding to Gallager's capacity for FSCs in Q) and unit cost 
(corresponding to the complete noise-free feedback capacity of |9|). In contrast to this analytical lower bound, our algorithm 
(cf. BAA-Action, Section XI i, provides a series of upper and lower bounds which converge to the actual capacity (when it 
exists). For the same FSC, we also derive bounds on the capacity when only the decoder takes binary actions. A special case 
of the framework when only decoder takes actions is that of coding on the backward link, where the decoder sends a symbol 
from the action alphabet based on the channel outputs observed so far, thus operating at an instantaneous rate which is log the 
cardinality of said alphabet. The capacity for this case is characterized in single letter form for some Markovian Channels. 

The rest of paper is organized as follows. Section [II] describes the channel model and formulates the problem studied in this 
paper. The main results of this paper are outlined in Section III Section IV is dedicated to capacity-achieving coding schemes, 
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while converse results are proved in Section[V] Section [Vl] cha racterizes the capacity for stationary, indecomposable, finite state 
channels without intersymbol interference (ISI). Section [VH| generalizes the framework from decoder taking actions strictly 
causally dependent on the channel output (i.e. A^ i — Ad to t he case when decoder can also use the current output 

i{Y 1 )). As special cases, Section Vffl-A outlines the capacity results when actions are 



to generate its actions (i.e. A d ,i — Ad., 
taken only by the encoder while the case when only decoder takes actions is discussed in Section VIII-B Section IX presents 
single letter lower bounds for a specific example of to feed or not to feed back (i.e. when actions are binary) for Markovian 
channels when only one of the two, encoder or decoder, takes the actions. Section [X] establishes that coding on the backward 
link for FSCs is a special case of our general framework, and computes the capacity for an example of a Markovian channel. 



Section XI presents a Blahut-Arimoto type algorithm (BAA-Action) to find series of converging upper and lower bounds for 



the case when encoder take actions which are cost constrained. The paper is summarized and concluded in Section XII 



II. Channel Model and Problem Formulation 

We begin by introducing the notation used throughout this paper. Let upper case, lower case, and calligraphic letters denote, 
respectively, random variables, specific or deterministic values they may assume, and their alphabets. For two jointly distributed 
random variables, X and Y, let Px, Pxy an d Px\y respectively denote the marginal of X, joint distribution of (X,Y) and 
conditional distribution of X given Y, X^ is a shorthand for n — m + 1 tuple {X m , X m+ i, ■ ■ ■ , X n _\, X n }. X n will also 
denote X{ 1 . When i < 0, X 1 denotes null string as it is also for X\, when i > j. X n \ l denotes {Xi, • ■ ■ , Xs_i,Xj_|_i, • • • , X n }. 
The cardinality of an alphabet X is denoted by \X\. We impose the assumption of finiteness of cardinality on all alphabets, 
unless otherwise indicated. 

We use the Causal Conditioning notation (• |j •) as introduced by Kramer in 1 1 ] and ifTTIl : 



N 

P(V N \\x N )^HP( yi \ X \yi-l). 



(4) 



We also use the following notation as introduced in J5] : 



N „ X N-1 



N 

i )=n p (»i a;< " i >» < ~ 1 )- 



(5) 



Note that both causal conditioning, P(y 



N 



and P(y 



N 



„N-1 



) are distributions on Y n for a fixed x , as they are non 



negative for all x , y and they sum to unity, i.e., 



J2p(v n \\x N ) = J2 p (y N 



„N-1 



) = 1- 



(6) 



The directed information I(X — > Y ), as defined by Massey in 



A' 



I(X N -> Y N ) =Y^I{X i ;Y i \Y i - x ) = E 



is given by, 

P(Y N || X N ) 



log 



P(Y N ) 



where E stands for expectation. Naturally, the directed information conditioned on a random object S, I(X N 
defined as, 



N 

i(x N -> y n \s) = J2 I (x l ;Y l \Y l -\s). 



(7) 



Y N \S), is 



(8) 



We model discrete time channels with memory as Finite State Channels (FSCs) introduced by Gallager in his book Q, as 
an apt class of models for channels with memory, e.g. channels with ISI, etc. The channel input symbols take values in the 
finite alphabet X and output denoted by Y takes values in finite alphabet y. The state takes values in a finite alphabet S. The 
stationary channel is characterized by the conditional probability law P(yi, Si\xi, Sj_i) satisfying, 



P(y i ,s i \x i ,s l ,y l ,m) = P(yi,Si\xi,Si-i), 



(9) 



and by the probability of the initial state P(s ). More precisely, without loss of generality, we can make the following 
assumption on our channel model, 



P{Vi,Si\x\s l ,y l ,a l e ,a l d ,m) = P{y l ,s l \x i ,s i -i), 



(10) 



where et e ^ £ A e and ad } i £ Ad are the encoder and decoder actions respectively as will be explained later. Messages M £ M. 
are assumed to be independent of initial state, sq. The FSC is without intersymbol interference (ISI) if 



P(si|sj_i,a;,) = P{s i \s i -\), 



(11) 
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Fig. 1. Modeling Feedback Sampling for the acquisition of feedback in Finite State Channels (FSCs). 



i.e., the evolution of the channel states is independent of the channel input process. The basic framework in this paper is the 
setting depicted in Fig. [T] The communication system has the following building blocks : 

• Encoder Feedback Logic : Generates encoder actions, {A ei ]f =l , using the function fj± e : A4 x Z l_1 — » A e i.e., 
A e ,i = /a c t (M, Z 1 " 1 ), where Z, ; G Z is the sampled feedback component. 

• Decoder Feedback Logic : Generates decoder actions, {Aj,i}£i> using the function fA di ■ y i ~ l — > Ad i.e., Ad.i = 
$A d iiX 1 ^ 1 )- where Y{ G y is the channel output. 

• Feedback Sampler : Generates sampled feedback, Zi = f(A e i, A^i, YA, where / is a deterministic function. 

• Channel Encoder : Constructs channel input symbol, Xi{M,Z % ~ 1 ), using the encoding function, / e l : A4 x Z 1 ^ 1 —> X. 

• Channel Decoder : Generate the best estimate of the message given the channel output, M(Y ), using the decoding 
function, f d : y N -> M. 

We are interested in characterizing the maximal rate of reliable communication under the average cost constraint, 



E[K{A^A N d )} 



1 N 

-^/ZHAe.^Ad^ 



< r. 



(12) 



where A(-, •) is a given cost function satisfying rnax 0e g^ e!0lJ g^ (1 A(a e , ad) — A max < oo. 
The joint probability distribution induced by a given scheme, 



P 



{m,a e ,a d ,z ,x ,s ,y , m) 

|jV^|' ° v " uy 11 ~\ a d,i=fA dz (y z - 1) }' i -{a c ,i = fA Cii {m,z*- 1 )} 
n 

II 1 {x,=f c , i ( m ,z'-^)}P{yi, «i 1 3ft, s i-l) 1 { 2( =/(a e ,i,a (i , i ,2/ i )} X 1 {A=/d(iv' 1 )} • 



1 - 



(13) 



Definition 1: A rate i? is said to be achievable if there exists a sequence of block codes (N, [2^^]) satisfying (12i such 
that the maximal probability of error, 

max Pr(m ^ ml message m was sent), 

me{i,- ,\2 NR ]} 

vanishes as N —> oo. The capacity of such a system is denoted by C which is the supremum of all achievable rates. 



III. Main Results 
Let So denote the initial state. We define C N (T) and CV(T) as, 

C N (T) = -j- maxmin I{X N ^Y N \s ) 

C N {T) = — maxmaxI(X N -> Y N \s ). 

N s 



(14) 
(15) 
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N 



Here max denotes maximization over the joint probability distribution, 

P(s ,x N ,a? ,a»,y N ,z N ) = P(s )Q(x N ,a? \\ z N - x )Q{a% \\ y N - 1 )P(y N \\ x N , s ) JJ l {Zi =f (aMi , yi)} , 

i=l 

such that E[A(Ae,A$)] < T, where 



N 



Q{x N ,a^ || z"- 1 ) 

Q{4 II y N ~ l ) 
i(x n ^y n \ Sq ) = ^/(x^r.ir 2 - 1 ,^) 



J\Q{x u a e s\x % \al 1 ,z l l ) 

i=l 
N 



(16) 

(17) 
(18) 



N 



log 



P(Y N || X N ,s Q ) 



P(Y N \\X N ,s ) = 



P(Y N \s ) 
HP(y t \x\tf-\ So ). 



N 



(19) 
(20) 



As z is a deterministic function of (af ,a d ,y 1 ), from now on we will consider maximization over the joint probability 



distribution, 



P{s ,x\a»,a N d ^) = P{s a )Q{x l \a i : \\ z N ^)Q{a% \\y N ^)P{y N \\x",s ) 



(21) 



where z,- t will stand for f(a e ,i, a d it yi) unless otherwise stated. Note that effectively maximization in definition of C_ N (T) and 
C N {T) is over Q(x N ,a* \\ z Nll )Q(a% \\ y N ^) as P(s ) is fixed and P(y N \\ x N ,s ) (and likewise P(y N \\ x N )) is a 
characteristic of the channel given by (Lemma 6 of J51), 



P(y N II *Vo) 



P(y N || x N ) 



/ N \ 

Y,P(sv)P(y N II x N ,s ) = Y / P( s o) (l[P(y l ,s l \x l ,s^ 1 ) 



(22) 



(23) 



Our main results are as follows, 

• Achievable Rate : For a communication abstraction as in Fig. [T] any rate R is achievable such that, 

log |sr 



R < lim C N (T) = sup 

TV— >oo jv 



<2jv(r) 



iV 



(24) 



Converse : Consider a coding scheme with rate R which achieves reliable communication over the FSC with feedback 
sampling as in Fig. [I] This implies the existence of (N, \2 NR ~\ ) codes such that the probability of error P^ goes to zero 
as N — > oo. For such a scheme given e > 0, 3 block length A*o such that for all block lengths N > No we have 



R < C N (T) + e. 



(25) 



Capacity : In the following cases we characterize the capacity exactly, 
1) For an FSC where the probability of the initial state is positive for all sq € S, the capacity is evaluated exactly, 



C(T) = lim C N (T). 

N— foo 



2) For stationary 'indecomposable' channels without ISI with feedback sampling as in Fig. [T] the capacity is, 



1 



C(r) = lim — max/(A: JV Y"), 
where max denotes maximization over the joint probability distribution, 

P(x N ,a»,a»,y N ) = Q(x N , af || z N ^)Q{a N d \\ y N - 1 )P(y N \\ x N ), 
such that E[A(Af,^)] < T. 



(26) 



(27) 



(28) 
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IV. ACHIEVABILITY 

We begin this section by proving that the limit of the sequence C_ N (T) exists. We then explain the encoding and decoding 
scheme followed by analysis of probability of error, showing that any rate R is achievable such that, R < C(T) — 
limjv-^oo C_ N (T). Encoding uses random code-tree generation while decoding uses maximum likelihood decoding as in Q. 



A. Existence of C_(T) 

By the following theorem, we prove the existence of the limit of the sequence C N (T). 
Theorem 1: For a finite state channel with |«S| states, lirriAr^oo C N (T) exists and, 

Iog|5| 



lira Cjv(r) = sup 

iV— >oo jy 



Q N (T) 



N 



(29) 



Proof: Let N = n + 1, n,l £ Z + . Note that from the Section IV-B we will show that we achieve Q N by using random 
coding with distribution of form Q(x N ,a^ || z N ~ 1 )Q(a d || y N ^ x ) satisfying the cost constraints. Let us assume that C n (T) 
and Cj(r) are achieved by Q{x n ,a% \\ z n - x )Q{a n d || y^ 1 ) and Q(x l ,a l e || z l - x )Q{a d || y 1 - 1 ) respectively. 
Consider 



Q(x»,a» || z"" 1 ) 

ii / 



Q{a N d « 



Q{x n ,a n e || z n - x )Q{x\a\ \\ z 1 ' 1 ) 
QK || y n ^)Q(a l d || y 1 " 1 ). 



Therefore 



E[A(Af,^)] 



< 



n 
N 

nT + IT 



E[A(^,A3)] 

r. 



n+l 



N 



(30) 
(31) 

(32) 
(33) 



Hence Q(x N ,a N 



y N-l\ 



)Q{a d || y ) (which is a distribution) satisfies the cost requirements, but it may not be capacity 



achieving for blocklength N so, 



NC N (T) > minI(X N -> Y ly \s ) 



I 



We now follow the steps as in Proof of Theorem 8 in [9| to arrive at 



N 



C N (T) 



log |S | 



N 



> n 



C n (T) 



log 151 



+ 1 



log|5| 



(34) 



(35) 



Hence the sequence, C_ n (T) is super additive for all n g Z + . The theorem is finally proved using the convergence of super 
additive sequences, as is done in Theorem 4.6.1 Q. ■ 



B. Encoding Scheme 

Encoding is based on generating separate code trees which is described below. These are then revealed to the encoder and 
the decoder. 

• Encoder Code-Tree : 2 NR code-trees are generated as follows, the i th encoder action and channel input symbol is 
generated using a probability mass function which depends on previous encoder action and channel input symbols and 
on the past sampled feedback sequence, i.e. Q(x l , ag|x l_1 , a^ -1 , z 1 ^ 1 ). 

• Decoder Action Code-Tree : We generate a single code tree at random, where the vertex represents decoder action symbol, 
ad,i generated with distribution Q(a ( j, t i\a d ~ Thus the present decoder action depend on the past actions as well 
as the past channel output. 

Note that {Q(x l , a^a; 1-1 , a* -1 , z 1-1 )}^ and {Q(a,d,i\a d ~ 1 , y l ~ x )}f = i correspond to the joint distribution on 
(X N ,A?, A%,S N ,Y N ) such that constraint E [A(Af , A$)] < T is satisfied. 

Fig. [2] illustrates the Encoder Code-Tree for a specific example. The setting in the right in the figure is the illustration of 
the setting of to feed or not to feed back, when the output alphabet is binary and, 

Zi = f(A e , i ,A dA ,Y i ) = *if A Bii ^A dti . (36) 
Zi = f(A e , i ,A dii ,Y i )=Y i if A e>i = A dti , (37) 

where * stands for erasure or no feedback. Knowing past channel outputs, decoder uses Decoder Action Code-Tree to figure 
out the decoder action symbol. Using the decoder action symbol ad.i, along with encoder actions, a e ,i and channel output 
yi, feedback sampler produces sampled feedback as Zi — f{ae,ii a diiVi)- I n this wa Y> given a message m, and the complete 
sampled feedback sequence thus obtained, there is a particular (x N ,a^) which can be found from the collection of 

encoder code trees. The encoder thus sends the corresponding x N though the channel. Note that our coding scheme is similar 
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(j/i,a dl i) {y-2,a i;2 ) 

(a!l,Oe,l) | (X2,ae.2) | (^3, 

(f(a e ,i, Qd,i, z/i)) 
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T 



r 



«i = 2/1 



(lc,3, a d,3,y-i) 

t 

T 

«3 = 2/3 



2l = _ 2 , = 




22 = * 




22 = * 




Fig. 2. This figure illustrates Encoder Code-Trees in our coding scheme. The left hand side figure depicts a general setting where Z = {a, b, c}, and 
Zi = f(a<, t i,a,d i,Vi)- The tree is shown for N = 3. The right hand side shows a specific example where j = Vi and output is binary. Actions of 
encoder, a E} i £ {0, 1} and Zj = /(a e ^, j, j/;) = yi if a e i = j or o e i = 0, else it is erasure(= *). Hence some portion of the tree collapses as by 
knowing a e ; we know the possible values of *j, for e.g. a e> i = 1 implies z; = * and a ej ; = implies, z% = or 1. 



in spirit to the code tree generation scheme as in |9|. However, here we generate both the cost constrained encoder actions 
and channel input symbols in one tree while decoder actions are generated in another tree. 
By the above code tree generation, we have in our achievability scheme, 

P(a: i ,o e , i |a: i - 1 > 4- 1 ,oS- 1 ,y i - 1 > 4) - Pfa, a^" 1 , a*" 1 , a*" 1 , y*- 1 , z*- 1 , 4) (38) 

= Qte.ae,^- 1 ^- 1 ,**- 1 ), (39) 

where first equality follows from the fact, = f(a e ^, a^i, Vi), while the second equality is due to our coding scheme where 
the i th input and encoder action symbol only depend on past input symbols, actions and sampled feedback. Similarly since 
i th decoder action only depends on past decoder actions and channel output, we have, 

P(a d , i \ji\y i -\x\j e ,si) = Q{a d , i \^\y i - X ). (40) 

Lemma 1: The joint probability distribution on [sq,x ,a d ,y N ), by the achievability scheme described above is, 

P(s ,x N ,a?,a»[,y N ) = P(s )Q(x N ,a? \\ z^'Ma" \\ y N - 1 )P(y N \\ x N ,s ). (41) 
Proof: Using Property 1 in Appendix [A] we have, 

P{s ,x ,a e ,a d ,y ) = P(s )Q(x ,a e ,a d \\ y ,s )P{y \\ x ,a e ,a d ,s ). (42) 
From definition of causal conditioning and using Eq. ( |39| ) and ( ftO"} we have, 

Q(x N ,a^,a N d || y N -\s Q ) = Q(x N ,a» \\ z N ^)Q{a N d || y^ 1 ). (43) 
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Now again using Eq. p9[ and ( |40| > and the channel model assumption in Eq. ( pT0] > consider, 

N 



P(s», x N , a»,a»,y N ) = P(s )Q(x N , af || ^ 1 )Q(a^ || y^ 1 ) ]J P(y i: Si \x h Si _ x ), (44) 



III 



as. 



Summing over, and using the characterization of P(y N \\ x , so) in Section 

N 

P(y N || x N , s ) = Y,Ii P (y*> s ^ s -i): (45) 

we obtain, 

P(«o,*", = £ P( So w , a? , y w ) (46) 

= P^M^, af || z^M^ || l/*" 1 )^ || s", So ). (47) 

■ 

Corollary 1: From the steps in previous lemma it immediately implies, 

P(y N || x N , a? , a ) = Pfo" II ^ s ). (48) 
Note that likewise it can be also shown as in Eq. ( |48] i that, 

P(y JV ||x w , a f,a^) = P(^||x w ), (49) 
which we will use in next section on decoding. 

C. Decoding 

The decoder performs ML decoding, i.e. it chooses the message to for which P(y N \m) is maximized. 

JV 

P(y N \m) = Y\P{y l \y t -\m) (50) 

■i=i 

JV 

= \[P{y i \y i -\^ d {y i -%m,x\m,z i -%^{m^~ 1 )) (51) 
i=i 

N 

[] P(y^~\ X ), x l (m, z 1 " 1 ), aiim-z*- 1 )) (52) 



(6) 

--I 

JV || „iV „iV 

P(y N \\x N ), (54) 



= P(y w || x N ,a»,a%) (53) 

(c ) n/„JV || N\ 



where (a) follows from the fact that knowing to and we know (x l , a l e , a l d ). This can be iteratively shown. Given to we know 
(xi(m), ai(m)). We also know dd,i- Given yx, z\ = f(a e ,i, a-d,i, yi)- Hence now we know, (x2(m, zi), a e: 2{m 1 zi), a^. 2(2/1))- 
Iteratively we can conclude that for a given message to and true feedback sequence, y l ~ 1 , we can construct (x l , a l e , a l d ) 



knowing the codebooks. (b) follows from the assumption on channel model in Eq. (10 1 and (c) follows from Eq. (49 1. Hence 



ML decoding to construct message estimate, rh can also be done my maximizing causal conditioning, i.e., 

to = argmax P(y N \m) = argmaxP(y Ar || x ). (55) 



D. Calculation of Probability of Error 

We will see in this section that most of the proofs are similar to that in [|9| with Q(x N ,a N || z N ~ r ) being replaced 
with Q(x N || z N ~ 1 )Q(a d || y 1 ^^ 1 ). This is justifiable from our coding scheme that uses a distribution which is causal 
conditioning, Q(x N ,a^ || z N ~ 1 )Q(a d || y N ^ x ) and the optimal decoding which is finding 

argmax P(y N \ m) — argmaxP(y JV || x N ). (56) 
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Also from Lemma [T] we have, 

P{x N ,a»,a»,y N ) = £ P(s , x N , a? , y N ) (57) 

so 

= Q(x N , af || z N ^)Q{a N d \\ y^ 1 ) £ P^Cv" II x N , s ) (58) 

= Q(x N , af || z^" 1 )^ || y N - 1 )ply N \\ x N ), (59) 

where last equality follows from the characterization of P(y N \\ x N ) in Section III Due to factorization in Eq. (59 1 similar 
to the one in j9) as 



P(x»,y«) = Q{x" || z N -')P(y N \\ x») 



(60) 



we have parallelism in the proofs. 

Note that from now on we will not state the condition of cost constraints, i.e., E[A(A^ , A d )] < T explicitly in maximizing 
distribution. The distribution Q(x N || z N ^ 1 )Q(a d \\ y 1 ^" 1 ) will be assumed to be the one satisfying cost constraints. Let 
Pe : m denote the probability of error of ML decoding when message m was sent. Given message m, denotes the set of 
outputs that cause error in decoding to, i.e., 



P e , m = £ P(y N \m). 



(61) 



Theorem 2: Let M denote the total number of messages used in transmission and E(P e m ) denote the average probability 
of error over these ensemble of codes. Then for any p, < p < 1, 



E(P e ,m) < (M-1YJ2 



£ Q(x N ,a? || z N - l )Q{a N d \\ y N ^)P{y N \\ x N )^ 



x N ,a",a<* 



l+P 



(62) 



Proof: Refer to Appendix [B] ■ 
Let P e ,m{so) denote the probability of error given the initial state of FSC was so and the message to was sent. 

Theorem 3: Consider FSC with feedback sampling (Fig. [I]) having |<S| states. For any positive integer N and any positive 
rate R, 3 (TV, M) code for which for all messages m € {1, • • • , [2 ]}, all initial states so and all p, < p < 1, 



Pe.M < 4 IS^-"™^, 



where 



and 



F N (p) 



plog\S\ 

— h max 

N Q(x N ,a^\\z N - 1 )Q(a^\\y N - 1 ) 



minE 0<N (p,Q(x N ,a? \\ z N - x )Q(a% \\ y N - x ),s ) 

so 



(63) 



(64) 



E 0tN {p,Q(x N ,a^ || z N - 1 )Q{a^ \\ y N - l ),s ) 



£ Q(x N ,a? || z N -')Q(a N d \\ y N ^)P(y N || x»,so)& 



i+p 



(65) 



Proof: Proof is following the steps in proof of Theorem 10 in [9| once we have obtained the bound on E(P e m ) [Eq. 
§62\ ] in Theorem [2] ■ 

Theorem 4: E 0j jv(p, Q{x N , || z N ~ 1 )Q(a d \\ y N ~ 1 ),so) has the following properties, 



E 0iN (p,Q(x«,a« || z N - l )Q{a% \\ ^"^so) > 



1 

N 



I(X N ^Y N \s ))> 



dE . N (p,Q(x N ,a? || z N -i)Q(a% \\ y^so) 
dp 

d 2 E Mp 1 Q{x N 1 al || z N -i)Q(a% \\ y^^p) 
dp 2 



> 



> 0, 



where equality in Eq. (66 1 holds when p = 0, and equality holds on the left side of Eq. (67 1 when p = 0. 



Proof: Omitted as it is similar to proof of Theorem 11 in |9) with Q( 



x N || z N ~ 1 ) 



replaced by Q(x N ,a» \\ z N ^)Q(a^ 



(66) 

(67) 
(68) 

,N II 
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Lemma 2: We have the following results for the convergence of Fm{p), 

lim F N (p) = Foo (p) = sup F N (p) , (69) 

for < p < 1. The convergence of -F/v(p) is uniform in p and F oa (p) is uniformly continuous for p e [0, 1]. 

Proof: Omitted. Proof similar to Lemma 13 in (9). ■ 
Theorem 5: For any FSC with feedback logic let, 

E r (R) = max [F^p) - pR] . (70) 

0<p<l 

Then for any e > 0, 3 N(e) such that for N > N(e), 3 an (N, M) code such that for all m, 1 < m < M = [2 NR ], and all 
initial states, 

P e , m (s )<2- N[EAR ^ ] . (71) 

Proof: Proof is similar to Theorem 14 in [9 | using above Theorems [T] [2] |3] [4] and Lemma [2] to conclude that for every 
s , there exists a p* such that F OD (p*) - p*R > 0, for all i? < C(T). ■ 

V. Converse 

In this section, we will first prove some converse results. Later in this section, we will show that for FSCs where probability 
of initial state is positive for all s € S, the achievable rate and the upper bound coincide and hence the capacity is given by 
C(T). 

Theorem 6: Consider a coding scheme with rate R which achieves reliable communication over the FSC with feedback 
sampling as in Fig. [T] meeting the average cost constraints, Eq. (12i. For such a scheme given any > 0, 3 block length iVo 
such that for all block lengths N > No we have 

R < C N (T) + e N . (72) 

Proof: Let a message m is chosen uniformly with probability 2~ NR . 

NR = H(M) (73) 

= H(M\S ) (74) 
= I(M;Y N \S )+H(M\Y N ,S ) (75) 

< I(M;Y N \S ) +H(M\Y N ) (76) 

< I(M; Y N \S ) + 1 + P( N) NR (77) 

N 



(c) 



J2 HpilYi-^So) - H(Y i \Y i ~ 1 ,X i , Ai A\, M, S ) + 1 + P^NR (78) 

N 

Y / H(Y l \Y i -\S )-H(Y l \Y t - 1 ,X\S ) + l + P^ N) NR (79) 



i=i 

N 



= ^/pC^lF 1 - 1 ,^) + 1 +PWNR (80) 

1=1 

= I(X N ^Y N \S Q ) + 1 + P( N) NR (81) 

< maxI(X N Y N \s ) + 1 + P^ N) NR (82) 

so 

(d) 

< max max 1(1* -+Y N \s ) + 1 + P^ N) NR (83) 

so 

= NC N (T) + 1 + P^NR, (84) 

where 

• (a) follows from the independence of message and initial state. 

• (b) follows from Fano's inequality. 

• (c) follows from proof of MCI in Appendix [C] 

• (d) has its first maximization over the joint probability distribution 

P(s ,x N ,a?,a%,y N ) = P( So )Q(x N ,a? || z N ~ l )Q{a% \\ y N - 1 )P(y N || x N ,s ), (85) 
which satisfy the expected cost constraints, E[A(A^ ,A^)] < T and Eq. (85 i follows from Lemma |5] in Appendix |c| 
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Hence we have for sufficiently large N for any given ejv > 0, 

R< C N (T) + e N . 



(86) 



Note that unlike in (9), limit of Cjv(r) may not exist because sub-additivity (like the one in Theorem 16 in [9]) breaks due 
to the presence of cost constraints. Hence for a general FSC, we have the above converse result for a give blocklength N and 
probability of error, ejv- However if for the FSC, the probability of initial state is positive for all states, then we have the exact 
capacity as shown by the following theorem. 

Theorem 7: Consider an FSC with feedback logic where all the initial states G S have positive probability. The capacity is 
C(T). 

Proof: The proof is similar to Theorem 17 in |9| with change in equalities in (c), (d) and (e) below. Let Pj v (so) denote 
the probability of error when the initial state is sq. Since every initial state sq G S can occur with non zero probability, this 
implies that there exists a sequence of block codes (N, [2 NR \) with Pj v (so) — > 0, Vsq G S. Hence we have, 



NR 



(a) 



(b) 



H{M) 
H{M\s ) 

I(M;Y N \ So )+H(M\Y N , So ) 



< I(M; Y n \sq) + 1 + p( N \s )NR 



(d) 



(e) 



/V 



Y,H(Y t \Y^\s a ) - HiYilY^^X^AlAiM^o) + l + p( N \s )NR 
»=i 

JV 

Y.HiY^-^so) - HWY^^^AiA^so) + l + pW(s Q )NR 
i=i 

N 

Y,H{Y % \Y l -\s Q ) - HiYilY'-^X^so) + 1 + P^(s )NR 



i=l 
N 



^2 I(X l ; Y l \Y*-\ So ) + l + P W (s )NR 



»=i 

= i{x 

(/) 

< min 



y^M + i + pr^P 



I(X N Y N \s ) + 1 + pW(s )NR 



where 



(a) follows from the fact that message M is independent of initial state sq. 

(b) follows from Fano's inequality. 

(c) follow from similar arguments as in |IV-C| 



• (d) follows from the assumption of channel model as in Eq. ( 10 1 

• (e) follows from proof of MCI in Appendix [C] 

• (f) follows from the fact that Eq. ( |95] l is true for all sq G S. 

Hence since we have Pj v (so) — > 0, Vsq G S, we have, 



P< lim — max min I(X 

JV-s-oo N s 



N 



Y N ) 



(87) 

(88) 
(89) 

(90) 
(91) 

(92) 

(93) 

(94) 

(95) 
(96) 



where due to Lemma [5] in Appendix [C] the maximization is over the joint probability distribution, 

P(s ,x N ,a^,a^y N ) = P(s )Q(x N ,a? \\ z^'Ma" \\ y N - 1 )P(y N \\ x N ,s ), 

which satisfy the expected cost constraints, E[A(A^,A^)} < V. This implies from the achievability result of Section 
capacity is, 



IV 



(97) 

(98) 
that 



C(T) = C(T). 



(99) 
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VI. Capacity for Stationary Indecomposable FSC without ISI 

We assume now that state transition is a separate markov chain and does not depend on input, i.e., P(yi, Si|sj_i, Xi) = 
P(si\si-i)P(yi\si, Si-i, Xi). Such a channel is said to have no ISI. We further assume this channel is indecomposable as the 
definition given below, 

Definition 2: An FSC without ISI is said to be indecomposable if, for every e > 0, 3N such that VN > N 



\P(sn\s ) - P(s N \s' )\ < e V s N , s , s' . 



(100) 



A necessary and sufficient condition for a no ISI, FSC to be indecomposable [c.f. Theorem 4.6.3, Q] is that there exists a 
choice for the n state, say s n , such that, 



q(s n \s ) > 0, Vs e S. 



(101) 



Furthermore, if the channel is indecomposable, n above can always be taken less than 2l 5 l 2 . This condition [Theorem 6.3.2, 
0] also implies existence of a unique steady-state stationary distribution ir(s) , i.e., 



lim P(Sn = s\sq) = 7r(s). 

N— foo 



(102) 



The channel is stationary if P(so) = tt(so). 

Theorem 8: For a stationary and indecomposable FSC without ISI and with communication abstraction as in Fig. [TJ the 
capacity of the channel is given by, 

C(r)= lim C N (T)= lim — maxI{X N -> Y N ), (103) 



where max denotes maximization over Q(x N || z N 1 )Q(a d * 

Proof: The proof is similar to proof of Theorem 18 in |9| with Q(x N 

y 



N ~ 1 ) such that E[A(Ae , A^)} < T. 



,N-1\ 



) replaced by Q(x 



y N-l 



)QK 



N 



VII. Causal Action Encoding at Decoder 

In this section we generalize the framework in Fig. [T] where now decoder actions also depend on the current channel output, 
i.e., Ad,i — fA d i(y 1 )- The setting is depicted in Fig. pi Note that the capacity in this generalized setting can be strictly better 
than that in Fig. [TJ To get an intuition for it, one can consider a markovian channel, i.e., an FSC for which, 



p(y i ,s i \x i ,s i - 1 ) = p(y i \x i ,s i - 1 )P(s i \s ir . 1 ) 



(104) 



The decoder knows the states along with the output on the fly, feds back the effective output, Yfb,% = (Yi, Si) to the feedback 
sampler and the feedback sampling function is specialized to f{A e ^, Ad } i, Yfb.i) — A^.i, A^a = fA di (Y l .S % ). Further 
|„4| = \S\ and there are no cost constraints. We will see later in Section [x] that this is the setting of coding on the backward 
link in FSCs with no constraints on active feedback symbols. As will be shown in Section [X] that the capacity of this system 
is the same as that when encoder and decoder both have state information and it is achieved by setting Ad : i = Si. Here we are 
able to do better because Xi(M, A l ~ x ) can be generated using Si-i on which the channel output depends {P(Yi\X i: 
Thus, it is easy to see that under such a framework for the setting in Fig. [TJ i.e., when A^.i — Ja a ( {Y l ~ 1 , S l ~ 1 ), capacity 
can be comparatively strictly less, as channel input can at most depend on state upto S l ~ 2 and has no information about the 
state Si-x which determines the channel output. 



feedback sampler A dii (Y r ) 



ENCODER 
FEEDBACK 
, LOGIC 



M 6 {1 : 2 nR } 



A e>i (M, Z l 



CHANNEL 
ENCODER 



Z, = f(A eM A d „Yi) 



Z< 



Xi(M, Z*- 



FINITE STATE 
CHANNEL 



Y> 



Y, 



DECODER 
FEEDBACK 
LOGIC 



Y< 



CHANNEL 
DECODER 



M 6 {1 : 2 nR } 



Fig. 3. Modeling Feedback Sampling for the acquisition of feedback in Finite State Channels (FSCs) when decoder can use the current channel output 
also to generate actions. 



Theorem 9: Consider the system in Fig. [3] We have the following results paralleling those in Section III (for the setting of 
Fig. [Tj. 
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Let so denotes the initial state. We define C_ N causa i (T) and CV iCausa i(T) as (where causal indicates that decoder actions 
can also depend on current channel output), 

QN.causaiP) = ^ max nun I(X N -+ Y N \s ) (105) 
C N , causal (T) = -j- maxmax I (X N ^Y N \s ). (106) 



Here max denotes maximization over the joint probability distribution, 

P(s ,x N ,a? ,^,y N ,z N ) = P(s )Q(x N ,a? \\ z N ~ l )Q(^ \\ y N - 1 )P(y N \\ x N , s ) TT l {z ^ f(a ^ My ^ )} , (107) 



N 



such that E[A(^4^, $^ < T. Here (j>^ , (f>^ \ y N are particular realizations of random variables \ y n and 

<l>d,i\vi = fA^.yM^t^d (108) 

<t>,i,i = {/i^d/'i.i/eWe^ 1 (109) 

<Pd\y« = {4>d,i\ yi }tl- (HO) 

With slight abuse of notation 4>d,i\y for each y g y denotes a function from y l ~ 1 to Ad and <^ j can be treated as a vector 
of functions {(f>d t i\y} y ^y- Note that Ajj = 4>d,i\yi and hence {<fid\y N } denotes the decoder action sequence. 
1) Achievable Rate : For a communication abstraction as in Fig. [3] any rate R is achievable such that, 

log|5|" 



R < lim C N al (T) = sup 



G-N, causal (^) 



N 



(HI) 



2) Converse : Consider a coding scheme with rate R which achieves reliable communication over the FSC with feedback 
sampling as in Fig. [5] This implies the existence of (N, \2 NR ~\ ) codes such that the probability of error goes to 
zero as N — > oo. For such a scheme given e > 0, 3 block length iVo such that for all block lengths N > Nq we have 

R<C N , causal (T) + e. (112) 

3) Capacity : In the following cases we characterize the capacity exactly, 

a) For an FSC where the probability of the initial state is positive for all sq £ S, the capacity is evaluated exactly, 

C causa ;(r) = lim C N al (T). (113) 

N— >oo 

b) For stationary 'indecomposable' channels without ISI with feedback sampling as in Fig. [3] the capacity is, 

C ca usai(T)= lim 4max/(X JV ->Y N ), (114) 

N—too iv 

where maximization is over the joint probability distribution, 

P( x N ,a^^,y N ,z N ) =Q(x N ,a? || z^Q^ \\ y N - 1 )P(y N \\ x N ) f[ l {zi=/K>4 A , 4 lm , yi)} , 



such that E[A(Ag , \y n )] < T 



Proof: The proof is straightforward as it uses the similar results as stated in Section III for the framework in Fig 
decoder actions do not depend on current channel output. The argument is as follows. Notice that the setting in Fig. 
decoder takes actions Ad. iiY 1 ) and the sampling function is Z.- L = f(A et i, Ad,i,Yi) is equivalent to the setting in Fig 
decoder takes actions Ad^iY 1 ^ 1 ) e Ad = A? , or Ad , = $d,i as defined in the Theorem above and feedback sampling 
function is, 



where 
where 
where 



Zi = g(A eti ,A d , h Yi) = /(A^^ilY^Yi). (115) 

More precisely operationally, the generalized setting when decoder takes action which also depend on current output is equivalent 
when decoder takes an action vector, depending only on past channel output, for each of the possible Y's, and the feedback 
sampling function uses the current output to extract out the corresponding action from this action vector to generate sampled 



feedback. Hence all the above results are derived from those in Section III by the transformation, Ad &d, /( ) ^ <?(•)• The 
cost constraints hence are equivalent to E[A(A^, $^|y]v)] < V. The idea is similar to that of Shannon strategies, [45 1 ■ 
Note 1: Note that we started by solving a seemingly more restrictive case, i.e., the setting in Fig. 1 where decoder actions 
depend on the channel output strictly causally. In this section, we applied our results for the setting of Fig. [T] to characterize 
fundamental limits for the setting in Fig. [3] where decoder actions can depend also on the current channel output, by showing 
that the latter setting can be embedded in the former via an appropriate extension of the decoder action alphabet. Thus, the 
setting of Fig. [T] is in fact more general than that of Fig. [3] Interestingly, in the other direction, it does not appear that the 
results for the seemingly more restrictive setting of Fig. [T]can be deduced from those for the setting of Fig. [3] 
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VIII. Special Cases 

A. Feedback Logic At Encoder 

The basic framework in this subsection is the setting in Fig. |4] Here only the encoder takes actions to govern feedback 
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Fig. 4. Modeling Feedback Sampling for the acquisition of feedback in Finite State Channels (FSCs) with only Encoder Feedback Logic. 



sampling. 

Theorem 10: For no ISI, stationary and indecomposable FSC with encoder Feedback Logic as in Fig. [4] the capacity is 
given by, 



C enc (T) = lim — max I(X N -> Y N ). 

AT->oo N Q(x N ,a fr \\z N - 1 ),E[A(A N ')]<r 

Proof: Specialize Theorem ||] as A d = qb, A e = A and A(A e , A d ) = A(A e ) = A(A). 



(116) 



B. Feedback Logic At Decoder 

In the previous sub-section, we characterized the fundamental limit for the communication system as depicted in Fig. [4] 
where encoder took actions to govern acquisition of feedback from decoder. However in some practical systems, it is the 
receiver (or decoder) which estimates the channel state perfectly and then decides to send it to the transmitter (or encoder) 
through noise-free feedback, |46|. To model such a system where sending noise free feedback from receiver to transmitter is 
costly and is governed by actions taken by the decoder, we consider the communication abstraction as in Fig. [5] 

Theorem 11: For no ISI, stationary and indecomposable FSC with decoder Feedback Logic as in Fig. [5] the capacity is 
given by, 



C dec (T) = lim — max I(X N 

N^oo N Q(x N \\z N ~ 1 )Q(a N \\y N - 1 ),E[A{A N )]<r 



Y N ) 



(117) 
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Fig. 5. Modeling Feedback Sampling for the acquisition of feedback in Finite State Channels (FSCs) with only Decoder Feedback Logic. 

Proof: Specialize Theorem [8] as A e = (f), A d = A and A{A e , A d ) = A(A d ) = A(A). ■ 
Note 2: Also note that if in Fig. [5] decoder can use current channel output to generate actions, we can do the appropriate 
transformation in Theorem [9] to arrive at, 



C dec . causa i{T) = lim - max I(X N — > Y N ). 

W^oo N Q(x«||z«-i)Q(^«||a«-i),E[A(<[.«| rJ v)]<r 



(118) 
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Fig. 6. Modeling logic 'to feed or not to feed back' in Finite State Channels (FSCs) with encoder taking actions. 
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IX. Numerical Example 1 : To Feed or Not to Feed back 

A. Encoder Actions 

Consider the setting as depicted in Fig. [6] Now the actions are binary, i.e., A = {0, 1}. In this setting, action sequence 
determine to feed or not to feed back a deterministic function of the past channel output, i.e., 



Zi = f{A l ,Y i ) = g(Yi), if Ai = 1 
Zi = f{A l ,Y i ) = *, ifA t = 0, 



(119) 



where * stands for erasure or no information about feedback. As a specific example for such a setting consider the 
communication system involving Markovian channel as in Fig. [JJ which is essentially a no ISI, stationary, indecomposable 
FSC. Let the stationary distribution be given by 7r(s), Vs € S. The feedback from the decoder at time i consists of tuple 
Yps,i = (Yi, Si) and observed or sampled feedback Zi = g(YpB,i) = Si, when A\ = 1. The cost function, A(a) = a, a e A 



and the cost constraint is T G [0, 1]. Using Theorem 10 the capacity of such a system is given by, 

C enc {T) = lim C^ nc (T) = lim ~ max I(X N -> (Y N , S N )). 

In the following, we give single letter lower bound on C enc (T). 

Theorem 12: The capacity of the system in Fig. [7] with encoder feedback logic is lower bounded as, 

Cenc(r) 2^ ^ 'enc, lower 

(r) = max/(X; Y\S), 

where maximization is over joint probability distribution, 

Ps,A,z,x,y(s, a, z, x, y) = ■ns{s)P A {a)l {z=f{a ^ ) }P x \z,A(x\ z i a ) p Y\x,s(y\x, s). 

and E[A(A)] < T. 



Proof: The joint distribution in maximization in Eq. ( 120 1 is 



JV 



P(s$, a N , y N ) = P(s ) 11 Qixuailx^^^-^Piyilxi, s;_i)P( Si | Si _i)l 



{«<=/((»*,*)}• 



i=l 



To derive the lower bound we consider the following special type of above distribution, 



JY 



P'(s£, a N , y N ) = P{s ) J] QiaiMxilzi-i^ivilxi, s i _ 1 )P(s i |s i _i)l 



{**=/(<*,»«)}• 



i=l 



P'(s£, a N , z N , x N , y N ) = $4(04, Oi-x, Zi.!, a,, «i, x i} yi )^ nV (a^ 2 ,af +1 , z? +1 ,x n \\ y^) 



Note that right hand side of the above distribution can be factorized as, 

»o ><* >* »* i» ; — 1 ". ■ '■'»— 1? s i-i 
which proves the markov chain, 



(120) 



(121) 



(122) 



(123) 



(124) 



(125) 



(126) 
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Fig. 7. To feed or not to feed back when encoder takes actions and decoder knows the state. States are stationary and evolve as a markov process. 



Hence we have, 



C, 



N 

enc 



> 




1 


max 








N 






1 




max 








N 






1 




max 








N 


(a) 




1 




max 








N 






1 




max 








N 


w 




1 




max 








iV 



N 



i=l 



i=l 



i = l 
JV 



»=1 

JV 



where 



(a) follows from Markov Chain (126i and from the channel model assumption [Eq. 10 1 

(b) follows from the chain rule 

Yi, Si\Si-i) = I{Xi\ Yi\Si-i) + I(Xi, Si\Si-i, Yi), 
and from the fact that J(JQ, i9j|Si_i, Y^) — 0, due the following state evolution, 

P(Si\Si-i) = P(Si\Si-i, Xi, Yi), 
which follows from the assumption on FSC for example in Fig. [7] as, 

PQS.SfclA^Si-i) = PiYilXi^i-JPiSilSi-i). 



(127) 
(128) 

(129) 

(130) 

(131) 

(132) 
(133) 



(134) 



(135) 



(136) 



17 



i=l 



The maximum in the above inequalities is taken over set of the distributions, 

N 

Si = {[[ QK)Q(^h-i) : E[A(A N )} < T}. (137) 

Clearly 

52 = {nQWQ(x ! |z I .i),E[A(A 1 )]<r, Vi}C5i. (138) 

Now since the channel is stationary, it is invariant in time shift, hence P(si) = 7r(sj), Vi. Therefore we have the lower bound, 

1 N 

Cenc.loweriF) = Jim — max V] T ( X i'> Y i\ S i~l) ( 139 ) 

/V^oo N nf =1 Q(o«)Q0iu|*«-i):E[A(,4 ( )]<r^ 



N 



(c) 



< lim -J-V max J(J*Q; l^-i) 

JV_>0 ° ^ £fn£iO(«0«(*«l*«-i):E[A(A 4 )]<r 

1 w 

= lim — V max 7pQ, Y^-i), 

iV-s-oo iV ^— f P(ai_i,Si_i,Zi_i,x 4 ,i; 4 ) 

2—1 



where 



(140) 
(141) 



• (c) follows from the identity, max a [/(x) + g{x)\ < max a f(x) + max a g(x). 

• (d) has, 

P{a i - 1 ,Si-i,Zi- 1 ,x i , yi) = 7r(s i _i)Q(a i _i)l {2i _ 1=/ ( ai _ 1;Si _ 1 ) } Q(a; i |2; i _i)P(y i |a; i , Sj_i), (142) 

such that E[A(j4j_i)] < T. do is assumed to be a constant. 
The inequality (c) holds with equality iff, 

P(a i - 1 ,Si- 1 ,Zi-i,Xi,y i ) = argmax I(X;Y\S) Vz, (143) 

P(o,s,z,x,i/).E[A(A)]<r 

where 

P(a, s, z, x, y) = 7r(s)Q(a)l{ z=/ ( aiS )}<5(x|z)P(2/|x, s). (144) 
Note that for our setting, P(x\z, a) = P(x\z) as knowing z determines a. Therefore, we have 

C encMwer (T) = m&xI(X;Y\S). (145) 
with maximization over the joint distribution, 

Ps,a,z,x,y(s, a, z, x, y) = ns(s)P A {a)l {z =f( a>3 ) } Px\z,A(x\z, a)P Y \x,s(y\x, s), (146) 

such that E[A(A)] <T. ■ 
Note 3: Note that lower bound on capacity at zero cost is, 

C enc .i ower (T = 0) = max I(X;Y\S). (147) 

This is indeed also the capacity at zero cost as derived below, Clearly 

C enc (T = 0) > C enc , lower (T = 0). (148) 
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Now 



C£c(T = 0) ^ ^max/(I w ;^5") (149) 
( = ] imaiJ^jy^lS") (150) 



-max^^l^-i^"^-^,^- 1 )-^!^,^-!^"^ 4 - 1 ^^,^- 1 ) (151) 



i=l 
N 



< ^max^HpilS^-HWXuSi-i) (152) 



TV 

■i=i 

1 ^ 

- ^^max/^;^!^!) (153) 

i=l 

= max I(X;y|S), (154) 

7TS-Px-Pl-|X,S 

where 

• (a) follows from the fact that A4 = for all 1 < i < /V and hence since there is no feedback mutual information is equal 
to directed information ( EU1 ). 

• (b) follows from the fact that X N and S N are independent. 

• (c) follows from the fact that conditioning reduces entropy and from the channel model assumption for the FSC, i.e. Eq. 
{10). 

• (d) follows from the identity, max a [/(s) + g(x)] < max a f(x) + max a g{x). 

• (e) follows from the fact that maximization in (e) is on the joint, 

N 

P(x N ,s$,y N ) = P(x N )P(s )l[P(y l ,s l \x l , S ^ 1 ). (155) 

i=i 

Hence the joint on (JQ, Yi, <Si_i) is equivalent to, 

P(X U Yi, S^) = PiSi-iWXjPWXuSi-!) = TtiSi-^PiX^PiYilX^Si-!). (156) 

Hence combining Eq. ( |148[ > and ( |154| i we establish equahty, 

Note 4: Just like capacity at zero cost, we can also show that lower bound on capacity at unit cost is indeed tight too with 
similar steps as above. Hence we have, 

C enc (T = l)= max I(X;Y\S). (157) 

TS-Px|S-fY|X,S 

Note that this scenario of complete feedback from decoder with state information is similar to the scenario where encoder and 
decoder know the states. The capacity result for such a communication system was characterized in l47l for channels with 
memory and indeed it coincides with Eq. ( 157 1. 

Note 5: It is interesting to observe that the lower bound on capacity is the Probing Capacity of the system considered in 
[40|(as depicted in Fig. [8]), where 

• Channel is memoryless with state distribution that is i.i.d. as the stationary distribution its of the channel with memory 
considered here. 

• Encoder takes message dependent actions that are binary and decide to observe or not to observe channel state. 

• Decoder has complete state information. 



B. Decoder Actions 

Theorem 13: Consider again the system in Fig. [7] but now with decoder feedback logic (instead of encoder feedback logic) 
with decoder taking actions causally dependent on the channel output and state (for the sake of simplicity of notation here we 
denote this capcity by Cd ec (X) instead of Cd e c,causai(^))- The capacity of such a system is lower bounded as, 

C dec (T) > C deCtlower (T) = mzxI(X;Y\S), (158) 

where maximization is over joint probability distribution, 

Ps,A,z,x,Y{s,a,z,x,y) = n s {s)P A i s (a\s)l {z= f M} P x \z,A(x\z, a)P Y \x,s(y\x, s), (159) 
such that E[A(A)} < T. 
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Fig. 8. The Probing Capacity setting in |40] where encoder takes message dependent actions to observe state, encodes using partial state information 
non-causally while decoder knows the complete channel state. Note that here the state is i.i.d. 



Proof: Note the only difference in this lower bound as compared to C en . 



cdower 



is that in that there is Pa\s instead of Pa 
in the distribution of maximization. The proof is similar to proof of Theorem [T2j except the set of maximizing distributions 
taken here is, 



N 



S = {]lQ(a i \s i )Q(xi\z i - 1 ),E[A(A i )] <T}. 



(160) 



Therefore, 



CdecfT) > Cd, 



ec, Lower 



(r) 



lint -max/(X^(Y^)). 

N— >oo 1\ S 



(161) 



All the other steps follow as in proof of Theorem [12] ■ 
We evaluate the lower bounds for the example in Fig. [7] when a = f3 = e = S = 0.5. The region is shown in Fig. [9] From 
the plot it is clear we can do much better than time sharing between capacity at zero and unit cost when either encoder or 
decoder takes actions. 




TIME SHARING BOUND 



0.1 0.2 0.3 0.4 0.5 0.6 0.7 O.f 



0.9 



Fig. 9. Cost-capacity trade off for example in Fig.^] C enCt i ower is the lower bound on capacity with encoder feedback logic. If instead of encoder decoder 
decides (causally dependent on channel output and state) when encoder will sample feedback, Cdec lower ls a lower bound on the capacity. The straight time 
represents time sharing scheme which is strictly sub-optimal. 



X. Numerical Example 2 : Coding on the Backward Link in FSC 

Consider the setting depicted in Fig. [10] We allow coding on the backward link, i.e., decoder encodes the channel outputs 
causally (A i (Y' 1 ) e .4) and sends it to the encoder. The encoder uses the acquired active feedback symbols to generate channel 
input symbols, i.e., Xi(M, A 1 ^ 1 ). For stationary indecomposable FSCs with active feedback we denote the capacity by Caf- 
The setting in Fig. [10] is a very special case of the framework of decoder feedback logic considered in Note [2] at the end of 



Section VIII-B when we let, f(Ad t i, Yi) = Ad t i — Ai. Hence using the above conditions in the capacity expression mentioned 
in Note |2j at the end of Theorem fTI] we have, 

C A f= lim -j-maxI(X N -> Y N ), (162) 
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where maximization is over joint distribution, 



N 



P(x»,a»,fl,y») = Q(x»\ 



N II „JV-1 



)Q(4>, 



N „ y N-l 



)P(y 



N „ X N 



{ ai =4> d j\ yi }, 



(163) 



such that E[A(a w )] = E[A($^| y jv)] < T, where $$\Y N ,4>$ are as defined in Section |vn| 

Now consider an example under this setting where the channel evolution is markovian with binary states, i.e. 

P(Y t , Si\X t , Si-i) = P{Y i \X l , S i - 1 )P(S l \S l - 1 ), 



(164) 



and states are known to the decoder on the fly. Hence, decoder performs coding on backward link as, Aj = Ai(Y 1 , S 1 ) E A. 
The markov chain is assumed to be stationary with distribution its and states take values in a finite alphabet S. We consider 
following special cases when \A\ > |5| : 
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Fig. 10. Modeling coding on backward link in finite state channels (FSCs). 



A. No Cost Constraints 

Theorem 14: Under this setting, the capacity is given by, 

C AF = max I(X;Y\S). (165) 

KsPx]S P Y\X,S 

Proof: Since the decoder knows the states, the effective output is the tuple YpB,i = (Yi,Si). Achievability is 
straightforward. Actions basically communicate the state, Aj = Si. Hence in this case, the setup is same as encoder and 
decoder knowing states, and by the notes at the end of Section |VIH-A| we have, 

C AF > max I(X;Y\S). (166) 

ws-Pxis-Py \x,s 

Now consider for the converse, 

Caf = lim max— /(X^ -> (Y N ,S N )) (167) 

i N 

= lim xaax — Y'H(Y i ,S i \Y i - 1 ,S i - 1 )-H(Yi,S i \X i ,Y i - l ,S i - 1 ) (168) 

i=l 
1 N 

< lim max- H(Y U S^S^) ~ H(Y U S^X,, S^) (169) 

N— too iv * — J 

i=l 



1 

= lim max — JlUXnYuSilSi-t) (170) 

N—too iv * — * 

i=l 

JV 

( = } lim max — J" I (Xi-YilS^) (171) 

N— too IV — ' 

i=l 

(b) 1 A 

< lim — Vinax/pC^Si-i), (172) 

AT->oo iv z — ' 
i=l 



where (a) follows from the similar arguments used for Eq. (132i while (b) follows from the identity max a [/(i) + g(x)\ < 
max a f(x) + max a g(x). Note that maximization in (b) is over the joint probability distribution, 



N 



P(s ,x N ,a N ,^,y N ) = Q{x N || a N -^ N d || y N - 1 )P(y N \\ x N , s ) ]J l {ai = Mvi h ( 173 > 



i=i 
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but one can average out, (a N , <f>j , s n ^ t 1 , x n ^ t , y n ^) so that for the i th term, maximization is over the joint probability 
distribution, 



P{s i -x,x i ,y i ) = ns(si-i)P(xi\si-i)P(yi\xi,Si-i), 
since states are stationary and distributed as Tig- Hence we have, 

Caf < max I(X;Y\S). 



(174) 
(175) 



Proof is completed using Eq. ( 166 1 and ( 175 1 



B. Cost constraint V 

The condition of this subsection differs from those of the previous in cost constraints. The intuition here is to look for an 
achievability scheme which decides when to send or not send state information from decoder to encoder depending on cost 
constraints. 
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Fig. 11. Modeling coding on the backward link for the Markovian channel with binary states. 



Theorem 15: For the system in Fig. [TT] the capacity is lower bounded as, 

Caf (r) > C AFtlower {T) = maxI(X;Y\S), (176) 
where maximization is over joint probability distribution, 

Ps,a,x.y(s, a, x, y) = TTs(s)PA\s(a\s)P X \A(x\a)P Y \x,s(y\ x ^ s )> ( 177 ) 

where E[A(A)} < T. 

Proof: We outline only the sketch of the proof as it is similar to the proof of Theorem 12 except the set of maximizing 
distributions taken here is, 

N 

S = {]J QioiMQixilcn-x), E[A(Ai)} < T}. (178) 

Therefore, 

Caf(T) > C A F,iower{T) = Um max I (X N (Y N , S N )) . (179) 

All the other steps follow as in proof of Theorem [12] ■ 
We consider an example under this setting as depicted in Fig. [TT] We assume A is a binary alphabet with cost function, 
A(a) = a, a G {0, 1}, hence this models the scenario of cost constrained one-bit active feedback in the given finite state 
channel. The plot for a = (3 = 5 = 0.5 is shown in Fig. 12 Note that this bound is equal to the bound, Cdec,iower in Theorem 



13 for /(a, s) = a. 
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Fig. 12. Cost-capacity trade off for example in Fig. [TT] Caf, lower is the lower bound on capacity. The dashed straight line represents a naive time sharing 
scheme. It is seen that not only is time-sharing suboptimal, but that the feedback capacity can be achieved in full even if observing only a small fraction of 
the symbols fed back. 



XI. Blahut Arrimoto Algorithm for Action Dependent Feedback 

In this section we will develop a numerical algorithm, as an extension to Blahut Arrimoto Algorithm for Action dependent 
feedback (BAA-Action), when the encoder takes the action to determine the quality and availability of the feedback from the 
decoder. Blahut, [41 1 and Arimoto, |42| suggested an algorithm based on alternating maximization to compute the mutual 
information, and this was extended to computing directed information in |j43l . Our approach is similar to the latter, with the 
difference being that in our case the aqcuisition of the feedback is determined by a cost constrained action. Our algorithm also 
works for the case when there is a joint cost constraint on both action and channel input symbols, thereby generalizing the 
result in l43l to compute directed information with cost constraints on channel input symbols. 



A. Algorithm : BAA-Action 

We formally state the algorithm which will be used later to give a series of computable upper and lower bounds. The goal 
is to maximize the normalized directed information, 

l I{X N^yN )= 1 £ p{x N^N n z N-l )p(y n „ yN}} bg g(jg I]*") (lg0) 

x N ,a N ,y N P ^ ' 

where maximum is over joint probability distributions p(x N , a N \\ z N ~ 1 )p(y n || x N )lr z N =/( a N ,y N )} sucn tnat E^A^)] < T. 
We will henceforth refrain from explicitly writing lr z N = f( a N iy N\\ as it is clear from the context how z N is a deterministic 
function of x N and a N . Let us denote p(x N , a N \\ z N ~ x ) by r(x , a N \\ z N ~ x ). Note further that effectively the maximization 
is over r(-) as p(y N || x ) is the property of the channel and r(-) determines whether the cost constraints are satisfied or not. 
Note that, 

p(x N ,a N ,y N ) = r(x N ,a N \\ z N - 1 )p(y N \\x N ) (181) 
= p(y N )q(x N ,a N \y N ). (182) 

Thus the directed information can equivalently be written as, 

I(X -> y ) = ^ r(x ,a \\z ) P (y || x ) log , N q7V .. N _„ (183) 

x N ,a N ,y N V ' 1 ; 

I(r,q), (184) 



A 



where the shorthand notations are defined as r = r(x N ,a N \\ z N ~ 1 ) and q = q(x N 7 a N \y N ). 

Our algorithm (presented above as Algorithm 1) computes the normalized directed information (Eq. |184| i by using the 
Lagrangian approach (outlined in Appendix [Dj. The Lagrangian multiplier, A, corresponds to the tradeoff between cost and the 
normalized directed information. Hence we define for A > 0, 1(r, q, A) = I(r, q) — AE[A(A JV )]. Denote Cjy = maxl(r, q, A) 
and r^ A ^ is the corresponding cost incurred at the maximizer of \ The evaluations as described in Algorithm 1, of 
and r' A ) together, characterize this tradeoff, and this tradeoff curve is obtained by appropriately sweeping through the values 
of A. 
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Algorithm 1 BAA- Action : Block length N and the channel p(y \\ x ) is given. The Lagrangian multiplier A is fixed 



Initialize q{x N " NL - N 
for i = N -> 1 do 



a" \y N ). For eg. initializing with uniform distribution, i.e., q(x N , a N \y N ) 



<- \X 



-N 



r'ix^a 1 ,^- 1 )^^ 



q(x w ,a N \y N )2- NXA ^ ^ 



n j= i+i r(xj,aj\xi- 



,zi-i) 



r(xi,di\x l 1 ,a l 1 ,z l x ) <— ^ a r'ix',^^- 1 ) 

where A ifZ = {y^ 1 : /(a^ 1 ,^" 1 ) = z^ 1 } 
end for 

Compute r(x N ,a N \\ z N - 1 ) <- H? =1 r(x u a^x^ 1 , a 4 " 1 , 

w a ivr(x JV ,a J, '||a^- 1 )p(i,f|| a ;JV) 



Compute g(x JV ,a JV |j/ JV ) <- ^ 



Calculate % — /j, where, 

, TV Ni JV \ 

^ <- ^E^a^K^,^ I! II $$>\ y 

Iu ^ jf max Xuai J2 yi max X2iQ2 • • • max If( ^ ^j, N II ^ lo g ]r 



if /(j — Jf, > e then Goto the loop again, 
end if 



Compute C\f' <- Iu- 

Compute <- E^w^Ka^a" || z N ~ 1 )p{y N \\ x N )A(A N ). 



The algorithm takes as input a particular block length N, channel p(y N \\ x N ) and the Lagrangian multiplier A. To begin 
with, q(-) is initialized with the uniform distribution as shown in Algorithm 1. This q(-) then is used to update r(-), which 
is then used to update q(-). The update rule is chosen so that directed information is maximized, hence this is an alternate 
maximization procedure, as in fiTi ll42l . fl3l . The lower and upper bounds converge to C$ for increasing number 

of iterations. 



B. Numerical Evaluation 

Here we propose and evaluate upper and lower bounds for the Example described in Fig. [7] where states are generated 
through a Markovian channel and only the encoder takes actions to decide whether the states, which are known to the decoder, 
will be fed back to the encoder or not. We will also contrast the bounds with the analytical lower bound C enc j ower (T) obtained 
in Section IX Our investigation in this section yields a contrasting upper bound to the capacity, which along with analytical 
lower bound C enC} i ower (T) provides tight and computable bounds for capacity. 

From Section [IX] the capacity of this channel (which is finite indecomposable) is 

lim -J- max I(X N — > Y N , S N ) = lim-J= max min I(X N -> Y N , S N \S = s) (185) 

JV->oo N r,s.t.E[A(A N )]<T W^oo N r,s.t.E[A{A N )]<r seS 

= lim i max I(X N -> Y N , S N \S = 0) (186) 

N^oc N T,s.t.E[A(A N )]<r 

where (a) follows from the structure of the channel due to the fact that Z and S channel induce similar joint distributions 
due to symmetry in their structure (one channel can be obtained by replacing 1 with and with 1), so we have 

max r(2 ,« iQ iv|| z iv-i iSo=0) I(X N -> Y N \S = 0) = max r(a ,« jQ iv|| z iv-i iSo=1) I(X N -t Y N \S = 1). Define, 



C N (T) = max ±I(X N ->Y N , S"|S =0) (187) 

where maximum is over joint probability distributions r(x N ,a N || z N ~ 1 )p(y n \\ x N , sp = 0)li z N—fi a N iy N)\ such that 
E[A(A Ar )] < r. We now have the following theorem, (the proof is deferred to Appendix |5}, 

Theorem 16: For the channel in Fig. [7] there exists computable bounds for C(T) defined for N > 1, where the lower bound 
is, C N {T -jf)< C(r) for rg[i 1] and the upper bound is, C(T) < C N {T), for T € [0, 1]. 
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Note that the bounds are tight and converge to the capacity as N — > oo. The computation is performed using the algorithm 



outlined above. For the case of a = [3 = 8 = e= 0.5, the bounds computed using the algorithm are shown in Fig. 13 where 



computation is performed for N = 2 and N = 3. Fig 14 contrast the upper bound of N = 3 along with the analytical lower 
bound C enc .i ower (r). Note from the graph, the benefit of using the full feedback is obtained around V ~ 0.2034. The code is 
available at 



Bounds on Capacity C (r) computed using BAA-Action 



0.323 




-■-Upper Bound N=2 
- ■ - Lower Bound N=2 
-e- Upper Bound N=3 
-0- Lower Bound N=3 



Fig. 13. Cost-capacity trade off for example in Fig. [7] Bounds correspond to our calculation from the algorithm for N = 2 and iV 



XII. Conclusion 

In this paper, we studied communication systems with finite state channels (FSCs), where the encoder and decoder adaptively 
decide what to feed back from the decoder to encoder to optimize for the rate of reliable communication, under an average 
cost constraint. For FSCs where probability of initial state is positive for all states or for stationary indecomposable FSCs, we 
have the exact characterization of the capacity. We also discuss the special case of to feed or not to feed back where either 
the encoder or the decoder takes binary actions that determine whether or not a deterministic function of channel output will 
be fed back to the encoder. As another special case, we characterize the capacity in case of coding on the backward link for 
FSCs. In case of Markovian channels, with explicit computation we show that the naive time sharing schemes can be highly 
suboptimal. Finally, we proposed a Blahut-Arimoto type algorithm based on alternate maximization to give computable upper 
and lower bounds for a class of Markovian Channel. 
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Appendix A 

Some Properties of Causal Conditioning and Directed Information 

Here we present some of the basic properties of causal conditioning and directed information. The proofs are omitted as 
being similar to the corresponding Lemmas in J9). 

• Property 1 . [Chain rule for causal conditioning] 

P(x N ,a N ,y N ) = P(y N \\ x N \a N )P(x N ,a N || y"" 1 ). (188) 

Similarly, 

P(x N ,a N ,y N ,s ) = P(y N || x N ,a N ,s )P(x N ,a N || y N -\s ). (189) 

• Property 2 : 

P(x N ,a^ z* -1 ) uniquely determines P(xi, a e ,i|x i_1 , a*" 1 , z 1 ^ 1 ) V 1 < i < N and all the arguments 
(x i - 1 ,dj^ 1 ,z i - 1 ), for which P{x i ~ l , a* -1 , z <_1 ) > 0. Similar results holds for P(a% \\ y 1 ^' 1 ). 

• Property 3 : 

\I{X N -> Y N ) - I{X N -> Y N \S)\ < H{S) < log \S\. 



Appendix B 
Proof of Theorem|2] 

From our achievability scheme and Lemma [T] we have 

P{x N ,a»,4,y N ) = J2 P ( s o,x N ,a»,a»,y N ) 

so 

= Q(x N , af || z N - 1 )Q{a N d \\ y N ^) £ P(s )P(y N || x N , s ) 

= Q(x N ,a? || z N -i)Q(a% || ^"^(V II x N ), 
where last equality follows from Eq. ( |23j ). Hence we have 

E(P e , m ) = E E P(x N ,a»,a%,y N )P(enor\m 7 x N ,a»,a»,y N ) 

y N x N ,a?,a% 



(190) 

(191) 
(192) 

(193) 



E E Q( X N ,^\\z N - 1 )Q(a^\\y N - 1 )P(y N \\x N )P( e rror\m,x N ,af,a^y N ). (194) 



y N x«,a^,a^ 



Let A m i = {event such that P y N\ m , > P y Ni m for m' ^ m}. Alternatively if for a message m, encoder generated 
A m , = {event such that P{y N \\ x' N ) > P(y N \\ x N ) for x' N ^ x N } 

P(A ml \m,x N ,a^a^y N )} = E Q& N >«? II ""'Wl* II ^'^(M^P^^n (^5) 



x N ,a' N ,a' N 



P(A m ,\m,x N ,a?,a%,y N )] < E Q{x N ,d e N \\ z N - l )Q{a d N \\y N -') 



x' N ,a' N ,a' N 



P{y N \\x' N )>P{y N \\x N )} 

P{y N II x' N ) 



P(y N || x N ^ 



any s > 0. (196) 



Hence, 



F(error|m,x iv ,af ,a^,y JV ) = P(U m ^ m A m , |m, z iv , a? , a?, y JV ) 

nun £ P^K^af,^, 2/^,1 

P 



< 



< 



£ P^lm^af,^,^) 



, for any < /? < 1 



(M-l) £ Q(^,a'/ || ^)g( fl y || ^) ( ^"^ 1 



Now substituting ( 200 1 in ( 194 1 and using s = we obtain, 



E Pe _ m < (M - iy 



£ Q{x N ,a» || z N ^)Q{a N d \\ y N ^)P(y N \\ x N )^ 



x" ,a" ,a- 



P+l 



Appendix C 
Proof of Some Markov Chains 

If a given scheme satisfies joint probability distribution , as in Eq. $13) , we have the following markov chains, 

MCI Y l -(X\Y l - 1 ,S )-(M,Al,A^). 

v* J , 



MC2 {X u A e ,i) - (X l -\Al-\Z l - r ) - {Y l -\A^\S ). 



MC3 A d>i - (Y i ~ 1 ,A^~ 1 ) - (X\ Al, Z^\S ). 
To prove MCI, consider again the joint probability distribution induced by a given scheme, 

*M,A" ,AV ,Z n ,X n ,S™ ,Y n ,MV TO i a e i a d ) z i x I s ) i/ J TO J 
1 - 

= |^| PS ( S °) II 1 {a d ,,=/A £iiI ^- 1 '} 1 {a e , 1 = /A E>1 (m, Z '- 1 )} 



x n i { :i; '=/=.'( m . z ^ 1 )} P ( 2/i '' S4 l a::i ' Si - 1 ) 1 { z i=/( a e.- a <J.-^)} X 1 {A=/ d ( a »)}- 



Summing over (M, , S^, Fj+i> A^ i+V A^ i+1 ) in Eq. (Q we obtain, 

= TmI II 1 {^, J =/A diJ (^- 1 '} 1 {a e , 3 =/A c>3 (m,^-i)}l{ a:3 =/ eiJ (m,^-i)} X J[ 1 



{zi=/(oe,i,ad,ii3/»)} 



= $!(M, A l e , Ajj, Z'" 1 , Y*-\ X 4 )$ 2 (T\ S l Q ,X l ) 

= $' 1 (M,4,4 ! z i - 1 ,y i - 1 ,x i )$ 2 (y i ,5j,x i ), 

which implies markov chain (M, A\, A l d , Z 1 - 1 ) - (Y l ~\X l , S ) ~ (Yt, S l ) which implies MCI. 
Lemma 3: If MCI holds, 

P(y N || x N ,a?,a%,s )=P(y N \\x N ,s ). 
Proof: This follows by chain rule in expanding P(y N \\ x N ,a d , so), 

N 

P(y N || x N ,a?,a$, So ) = n^lf*" 1 ' 1 *' -'^'* ) 



(*) 



»=i 



t=i 

P(y" ||^, s ), 
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where (*) follows from MCI. 

To prove MC2, again summing over (M, X^ +1 , Zf , Sf , Y t N , Af i+1 , Af t ) in Eq. (Eoil we obtain, 



F > M,Ai,Ai~ 1 ,Z ,i -' L ,X i ,Sn~ 1 ,Y i - 1 ( m > a e' a d ' ^ X -i x% i s \ i V % (211) 
1 

\M\ 



1 

T7T 1 {^=/e,i(m,z i - I )} 1 {a e ,i=/^. < (m, Z *- 1 )} II 1 {a e , j =f Ae , (m^- 1 )} 1 {xf=/ e , i (m,^- 1 )} 



x II 1 {««=/^j(^- i n- P (% , ' s jl a: i' s J-i) 1 {^=/(«a,i,a <J , < , 3 / < )} x PsOo) (212) 

= ^(M^i.^i.^-S^r 1 .^*" 1 )^^- 1 ,^ 1 ,^- 1 ^*- 1 ,^- 1 ,^- 1 ), (213) 

which implies the markov chain, (M, X h A eii ) - {X*- 1 , A^ 1 , Z l ~ x ) - (Y^ 1 , St\ A^f 1 ) which implies MC2. 
To prove MC3, we sum over (M, XF +X , Zf , S? , Y/*, A% i+1 , A% i+1 ) in Eq. J202I and obtain, 

^ > M,Ai,A^,Z i ~ 1 ,X i ,Si~ 1 ,Y i ~ 1 a ^ Z ' ^ ' S '2/ ) 

P§(s ) 

= [At] 1 { a: »=/=.'( m . zl " 1 ). a <=,i=/^< ! , l ( m . zi " 1 )} II 1 {°e, 3 =/A C:J (m :2: 3- 1 ) : x j =/ c , J (m,^- 1 ), Zi =/(a e , l ,a £i , I ,y,)}' P (%' S il X J! S J-l) 



i-1 



St 1 ,**, K, Z i - 1 ,At 1 ,Y i - 1 )9 2 {At 1 ,Y i - 1 ,A dli ), (215) 



which implies markov chain, A d>i - (Y* ,A*{~ ) - {X\ A l e , Z* , Sq~ , M), which implies MC3. 
Lemma 4: If MC2 and MC3 holds, then, 



Q(x N ,a»,a N d || y N -\s ) = Q(x N ,a» || z N - 1 )Q{a^ || y^ 1 ). (216) 
Proof: Applying chain rule for causal conditioning, 

N 

Q(x N ,a?,a% || y N -\s ) = JJ QCac*, a e ,i, a,,.*!**- 1 , a*" 1 , oj" 1 , y*" 1 , * ) (217) 

»=i 

= Q(x i ,a eii \x i - 1 ,ai- 1 ,at 1 ,y i ~ 1 ,s )Q(a dii \x\ai,at 1 ,y i ' 1 ,so) (218) 

= Q(xi, de^x^ 1 , a 1 ' 1 , a^f 1 , y l ~\ s )Q(a dii \x\ a\, a^ -1 , s ) (219) 

( = } Q(a: i) o , i |a: i - 1 1 a«- 1 ,« i - 1 )Q(a ( ,, i |oi- 1 > y i - 1 ) (220) 

= Q(ar", af || z"" 1 )^ || y^Pfo" || s"), (221) 

where (*) follows from the fact that, zi = /(a e ,ij ad,i, J/i)> while (**) follows from MC2 and MC3. ■ 
Lemma 5: If a given scheme satisfies joint as in Eq. ( p~3] > then, 

P( So , x N ,a^a^, y N ) = P(s )Q(x N , of || z^Q^ || ^"^P^ || x N , s ) (222) 

P(^,af,a^,^) = Q(x N ,a^ Wz^Qia^ \\y N - 1 )P(y N \\x N ). (223) 

(224) 

Proof: Using Property 1 in Appendix [A] we have, 

D / N N N N\ r>l \nl N N N \i N-l \ U l N || N N N \ /ticn 

P(s ,x ,a e ,a d ,y ) = P(s )Q{x ,a e ,a d \\ y ,s )P{y \\ x ,a e ,a d ,s ), (225) 

but already proved that MC 1 , MC2 and MC3 holds which implies by Lemmas [3] and |4] that, 

P(y N || x N ,a^a^ s ) = P(y N \\x N , So ) (226) 

Q(x N ,a»,a»\\y N -\s ) = Q(x N ,a^ Wz^Qia^ \\y N - 1 )P(y N \\x N ), (227) 

which implies, 

P(s ,x N ,a?,a$,y N ) = P(s Q )Q(x N ,a? \\ z^Q^ \\ y^Piy" \\ x N ,s ). (228) 
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Summing over sq, 



P(x ,a e ,a d ,y ) = 2_^ p ( s o, x ,a e ,a d ,y ) 

so 

= Q(x N ,a? || z^Qia" \\ y^ 1 ) £ P(s )P(y N \\ x N , s ), 



(229) 
(230) 



where last equality follows from Eq. ( 23 1 
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Appendix D 
Derivation of BAA-Action 



Here we will derive the BAA-Action algorithm (Algorithm 1) and also state the convergence results. We will state similar 
Lemmas as in |43|. Some proofs are completely omitted as they follow verbatim from the corresponding lemmas in |43|. 

Lemma 6: For a fixed A > 0, I(r, q, A) is concave, continuous, and has continuous partial derivatives in r and q. 

Proof: l(r, q) has continuous partial derivatives in r and q (follow the proof of Lemma 2 in [43 1 with r(x N \\ y N ^) 

replaced by r(x N ,a N || z^ -1 ) and q(x N \y N ) replaced by q(x N , a N \y ). E[A(A N )] depends only on r and is linear in r. ■ 

Lemma 7: For a fixed r, q* = argmax l(r, q, A), where 

q 



q*(x N ,a N \y N ) 



r{x N ,a N || z N - 1 )p(y N \\ x N ) 
a «K^,« W ll^" 1 )^ II 



(231) 



Proof: Note that for a fixed r, AE[A(yl JV )] is fixed. It is easy to prove, as in Lemma 4 in |43l . that I(r, q*) — X(r, q) > 
for all q. ■ 
From Lemma [7] the following corollary is immediate, 
Corollary 2: 



maxl(r. q, A) 



maxmaxl(r, q, A) 



(232) 



Let the set Ai tZ — {y 1 1 : f(a l 1 ,y l x ) = z l 1 }, V i G [1 : N], After proving the update of q as in Lemma [7] above, the 
main theorem in derivation of algorithm is the update rule for r which is as follows. 
Theorem 17: Fix A > 0, and q = q(x N , a N \y N ), then r* = argmaxl(r, q, A), where 



* / N N II N—l\ 

r [x ,a || z ) 

N 

JjKsi, Oils'- 1 , o*" 1 ,^- 1 ) 



(233) 
(234) 



where 



r(xi,a r \x l 1 ,a l \z l l ) 



r [x ,a , z 



2~2 Xi ,ai r '( xi ' ai ' zi *) 



(235) 



r'(x\a l ,z l - 1 )= n n 

t n n N 1I N 
+ l ' a i + l 'Hi 



q(x N ,a N \y N )2~ NXA ( aN *> 
Ilf=i+i r ( x j , a j I xj ~ 1 > 1 > zj ~ 1 ) 



p(y Jv ii- N )n|L 1± iKx J ,o j i 



ci- 1 .ai- 1 .zi- 1 -) 



(236) 



Proof: Here we fix q and try to find r = ]X=i T ^ (denote tj, = r{xi,ai\x 



A 



, z l 1 ), and pi = p(yi\x\y 



V i = [1 : N]), that maximizes the expression l(r, q) — AE[A(^4 Ar )]. Note as there is one to one correspondence between r 
and the factors {ri]f =1 , maximizing I(r, q, A) over r is equivalent to maximizing over the factors {n}jL 1 . Note that already 
proved, I(r, q, A) is concave in r, thus is concave in i\ if all the other factors are kept constant. Since constraints are linear, 
i.e., Yj X - a Ti = 1' concavity it follows we can use Lagrangian Multiplier method and Karush Kuhn Tucker conditions to 
perform this maximization from i = N to i = 1 maximizing the Lagrangian (described shortly) in a single factor one at a 
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time. The proof has similar steps as in Appendix A in 11431 . except there is a different Lagrangian, 



J = * E (p(y n U N )f[ 



x N ,a N ,y N 



3 = 1 



N / 

E E 

i=l Vl'-V*- 1 ,*'- 1 

^ 2. |y(.'/"ll-'-'H'y 



e (K^n^n^ 



log 



. (? ( 2 ; Ar ,a w | 2/ w )2- WAA ( aN ) 



E E 



(237) 



(238) 



where x i-i a «-i z i-i > are lagrangian multipliers. Hence for every, i € {1, • •• , JV} we have, 



dJ_ 

dn N 



\ E E(Wii^) n 



jv jv jv Ai , 



3=1, 
N 



g(x N ,a N \y N )2- NXA ( aN ) _ _ 



3=1 ' J 



± e ek 11 ^ n 



jv jv jv 4. 



log 



q(x N ,a N \y N )2~ N ^ aN '> 



3 = 1 '3 



-log(JJn) -logfo) - 1 

3 = 1 



= 0. 



(240) 
(241) 



As for a given r^, a 1-1 , are constants, and the term n$=i r « ^ s constant and indepedent of Ai tZ , and hence can be 

taken out of the summation and can divide the whole equation to get a new v*(i, a; 1-1 , a 1-1 , z 1 ^ 1 ) since v{i, ■) is a function 
of (x 1 ^ 1 , a 1 ^ 1 , z 1 ^ 1 ). Also the other three terms (log(nl = i r j)i \°E r i, 1) are constants with their coefficient being, 



/V 



E E Ky" II *") II 



EnV 



(242) 



jv jv jv 



3=i+l / A< , j=l 



Rest of the proof in Appendix A in referenced paper follows verbatim with q(-) inside the logarithm being replaced with 

q(-)2- NXA( - aN \ m 

Thus following the above lemmas we have similar Blahut-Arrimoto-Algorithm of alternating maximization. The above lemmas 
similarly as in 11431 provide natural lower bounds J^ 5 ) (A) indexed by k, the number of iterations and A, the lagrangian multiplier, 
given by, 



7 i fe) (A) = i ^ ) (^,a Ar ll^ 1 My"ll^)iog 



x N ,a N ,y N 



qW(x N ,a N \y N ) 
rW{x N ,a N || z"- 1 )' 



(243) 



Due to Lemma [6] above, which states, I(r, q, A) is concave, continuous, and has continuous partial derivatives in r and q, the 
alternating maximization procedure converges (cf. proof of Lemma 1 in 11431 ). 



if (A) t max ±I(X N -> Y N ) XE[A(A N )} ± C$> , 



(244) 



where j" implies convergence from below as k —> oo. Similarly there exist upper bounds, 



/ tf ) ( A ) = max E max ' ' ' max E p ( yN 'I log \ i \><n, n n v-i,- 

JV Xl,ai 'x 2 ,a 2 x N ,a N ^—-' l_^x N a N P\V II x ) r \ X ' a II Z ) 



P (y N || xf^-NWa") 



(245) 



To describe the mentioned nature of upper bound I^'(X), we have the following lemmas corresponding to those in 11431 (most 
proofs are verbatim and hence omitted or described briefly). 
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Lemma 8: Let I ri (X N — > Y N ) correspond to ri(x N ,a N \\ z N x ) and E n [A(A )] the corresponding cost incurred, then 

for any r (x N ,a N \\ z N ^) and A > 0, 

ll ri (X N ^Y N )~\E ri [A(A N )] 

n („N II rr N\n~N\A(a N ) 

(246) 



Proof: Consider, 



— V rUx N a N II z^- 1 ) Vnfiy w II x N )loe 1 11 J 

^J>- tT g E,« a ^o(^a-||z--W||x-) 

1 7 ri (A w -> F w ) + AE ri [A(A N )] (247) 



iV 



„n „n ii „jv-iw„jv n „iv 



(248) 



° )l0g E^^(^ll^-WII^) 

£>(Pi(v") II Po^)) > 0. (249) 



Lemma 9: For every A > 0, 

Cm < — minmax ) max • • • max > pfv^ II x^Moe^r , ., .. — ^ — ; — r- r — tt-t. — : . (250) 

N r Xl , ai ^x 2 ,a 2 .mm^ 19 " ; Y.t.n „wp(y N \\ x N )r (x N ,a N \\ z 1 *- 1 ) 

Proof: The proof follows verbatim from the proof of Lemma 9 in 

Lemma 10: The upper bound in Lemma [9] is tight and is obtained by r (x N ,a N || z* -1 ) that achieves the capacity. 

Proof: The proof follows in line to the proof of Lemma 10 in |43) except that there is an additional term — AE[A(A JV )] 
in the Lagrangian which accounts for the term 2 -ArAA ( a ™) in the right hand side of expression in Lemma [9] ■ 

Thus we have, I^(X) 4 Cjj^ as k — > 00, V A > 0, where the down arrow 4. implies convergence from above. Since the 
Lagrangian multiplier A characterizes the tradeoff, the corresponding point on the tradeoff curve is (C<v\r( A )), where 



E 

r N n N ,,N 



r*(x N ,a N || z N - 1 )p{y N \\ x N )A(A N ). 



(251) 



Appendix E 
Proof of Theorem[T"61 

For convenience, instead of N, consider the block length to be B divided into M sub-blocks each of length N. Define, 

C B (T) = max-I(X B -> Y B ,S B \S = 0) (252) 
B 
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where max is on appropriate joint distribution that satisfies the cost constraint. Now we have, 

B 



C B (T) = maxi^T/^y^ir- 1 ,^- 1 ,^ = 0) (253) 



B 

»=i 

B 

B 



(254) 
(255) 



m^^Y j H{Y i ,S i \Y i -\S i -\S = Q)-H{Y u S i \X i ,Y i - 1 ,S =Q) 
i=i 

B 

m a x-Y,H(Y u S l \Y l ~ 1 ,S l - 1 ,S Q = 0)-H(Y l ,S l \X l ,S t - 1 ) 
i=i 

, M Nj 

= max-^ H(Y l ,S l \Y l -\S l -\S o = 0)-H(Y l ,S l \X t ,S t - 1 ) (256) 

j=i i= jvo-i)+i 

, M iVj 

< max g ff^,^!^ 1 ,^,^.,^)-^,^!^,^.!) (257) 

j=l i= jv( i7 -l)+l 

< -max^7(X^._ 1)+1 -> 5^3_ 1)+1 ,5^._ 1)+1 |5jv ( j_i)) (258) 

1 M AT ' AT 

- gE^^^J-ll + l -^ Y NU-l) + l> S NU-l) + l\ S N(j-l)) (259) 

^E max ( &(<Sjvo-i) = s)I{X^ {j _ 1)+1 -> ^•_ 1 ) +1 ,^._ 1)+1 |5 JV(i _ 1) = s) I (260) 

51 E p ( Sw (n) = s ) max/ (^o-i)+i -> y wo-_i)+i>^o-i)+il^iv(j-i) = s ) (261) 
3=1 \ses J 



3 = 1 

1 

< 

- B 



(b) 1 
< — 

- B 



1 M 

= b E max/ «0-D + i ^O-D+i' = 0) (262) 

3=1 

where (a) follows from the channel definition, (b) follow from that fact that max(wi/(a;) + w 2 g(a;)) < wimax/(x) + 
W2maxg(x), and (c) follows from the symmetry of the channel structure, S = {0,1} here. But note that 
ma,xI(X^. — > 5jvy_!) +1 , ^y-ij+il^JVO'-i) = 0) = Cjv(rj), where Tj are cost incurred in each block such 

B l^j 



that, f < T. Thus we have, 



JV M 

C B (T) < g^CNiTi) (263) 

^ CW(;gE r i) (264) 

(ft) 

< CjvCr), (265) 

where (g) and (h) finally follow from the concavity and non-increasing nature of Cjv(r). Note the above holds for any TV thus 
with B -> oo, we obtain, C{Y) < C N (T) V N. 

We will now derive the lower bound. Here also assume the block length is B, which is divided into M sub-blocks of length 
N. The following achievability scheme is used sub-block by sub-block. In the last time epoch of each sub-block of length N, 
action is taken to observe the feedback. This feedback provides the initial state for rest of the N channel uses in the next sub 
block where encoder encodes to achieve, Cjv(r). Thus the total incurred cost is at most M \^ Vr — T + jj. Thus we have 
C N (T) < C(T + i) or C N (T - i) < C(T) for r e [±, 1]. 



